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Abstract 

We  study  one  of  the  basic  multicore  and  GPU  programming  models,  namely,  SPMD  (Single-Program 
Multiple-Data)  programs.  We  define  a  formal  model  of  SPMD  programs  based  on  interleaving  threads 
that  manipulate  global  and  local  arrays,  and  synchronize  via  barriers.  SPMD  programs  are  written 
with  the  intention  to  be  deterministic,  although  programming  errors  may  result  in  this  not  being  true. 
SPMD  programs  are  also  frequently  modified  toward  optimal  performance.  These  facts  motivate  us 
to  develop  methods  to  check  determinism  and  equivalence.  A  key  property  in  achieving  this  is  non¬ 
interference,  formulated  as  validity  of  logical  formulas  automatically  derived  from  the  program,  that 
imply  determinism.  Automatically  derived  post-conditions  can  be  used  to  check  equivalence  of  non¬ 
interfering  programs.  We  report  on  a  prototype  that  can  prove  non-interference  of  NVIDIA  CUDA 
programs. 


1  Introduction 

Writing  correct  programs  has  always  been  difficult,  and  a  large  part  of  computer  science  research  is  devoted 
in  developing  methods  to  assist  programmers  in  this  task.  Recently,  the  surge  of  parallel  computing  archi¬ 
tectures  such  as  multicores  has  brought  with  it  hopes  to  go  beyond  the  limits  of  Moore’s  law,  but  also  worries 
that  programming  will  become  harder  [5]. 

One  of  the  reasons  why  parallel  programming  is  difficult,  is  that  parallel  architectures  often  use  a 
multi-threaded,  shared-memory,  interleaving-based  programming  model.  This  results  in  inherently  non- 
deterministic  behavior,  which  is  hard  to  understand  and  debug.  This  has  led  some  researchers  to  claim 
that  threads  should  be  avoided  [23,  19].  Other  concurrency  models,  such  as  Kahn  Process  Networks  [16], 
ensure  deterministic  results  despite  process  interleaving.  Unfortunately,  most  multiprocessor  architectures 
widely  used  today  do  not  follow  such  models,  and  use  threads  instead.  What  is  worse,  the  semantics  of  these 
architectures  are  often  ambiguous  and  not  well-documented,  and  execution  sometimes  yields  unexpected 
results  [29]. 

The  goal  of  this  paper  is  to  develop  methods  that  help  programmers  build  correct  multi-threaded  pro¬ 
grams,  and  in  particular  programs  running  on  modern  graphics  processing  units  (GPUs),  such  as  the  NVIDIA 

*This  report  is  an  updated  version  of  [20],  with  major  addition  Section  7,  reporting  on  a  prototype  implementation  and 
providing  preliminary  ideas  on  how  to  handle  loops.  Thanks  to  Carlos  Coelho  for  useful  discussions.  Part  of  this  work  was 
done  at  Cadence  Research  Labs.  This  work  is  supported  by  the  Center  for  Hybrid  and  Embedded  Software  Systems  (CHESS) 
at  UC  Berkeley,  which  receives  support  from  the  National  Science  Foundation  (NSF  awards  ^^0720882  (CSR-EHS:  PRET)  and 
#0720841  (CSR-CPS)),  the  U.S.  Army  Research  Office  (ARO  #W911NF-07-2-0019),  the  U.S.  Air  Force  Office  of  Scientific 
Research  (MURI  #FA9550-06-0312),  the  Air  Force  Research  Lab  (AFRL),  the  State  of  California  Micro  Program,  and  the 
following  companies:  Agilent,  Bosch,  Lockheed-Martin,  National  Instruments,  Thales  and  Toyota. 
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GeForce  8  Series.  GPUs  enjoy  great  popularity  today,  as  a  result  of  offering  great  computing  power  at  rel¬ 
atively  low  cost  [28].  Motivated  by  this,  we  consider  the  GUDA  programming  model  [2],  used  in  NVIDIA’s 
GPUs. 

GUDA  is  based  on  the  Single  Program,  Multiple  Data  (SPMD)  parallel  computation  model,  where 
concurrent  threads  execute  the  same  code,  although  they  may  not  follow  exactly  the  same  execution  path. 
GUDA  is  free  from  some  of  the  plagues  of  parallel  programming:  for  instance,  it  does  not  provide  locks 
explicitly  (although  it  does  provide  barrier  synchronization).  On  the  other  hand,  GPU  programming  is 
difficult  because  of  another  reason.  A  “naive”  parallel  implementation  of  a  given  algorithm  is  in  most  cases 
non-optimal  in  terms  of  run-time,  i.e.,  runs  too  slow.  Thus,  a  significant  effort  is  spent  trying  to  optimize  the 
program  to  achieve  better  performance  [28].  This  is  done  by  exploiting  the  particularities  of  the  architecture. 
Although  no  general  rule  exists,  it  is  often  the  case  that  global-memory  accesses  are  very  expensive  and  thus 
need  to  be  reduced  to  a  minimum  so  that  they  do  not  create  a  bottleneck.  Moreover,  memory  bandwidth 
often  depends  on  how  memory  is  accessed,  that  is,  on  the  memory  access  patterns.  Subtle  modifications  in 
such  patterns  can  result  in  orders-of-magnitude  performance  improvements  [28,  2] . 

Optimizing  the  program  is  done  by  transforming  it  so  that  it  uses  the  specifics  of  the  underlying  platform 
optimally.  Gurrently,  these  transformations  are  done  “manually”,  since  automating  them  is  beyond  the 
reach  of  state-of-the-art  compilers.  Although  methodologies  and  guidelines  exist  to  help  programmers  (e.g., 
coalesced  global  memory  access  [28,  2]),  these  are  fairly  general  and  leave  a  large  gap  which  must  be  filled  by 
the  programmer’s  creativity  and  care.  This  is  a  difficult  and  error-prone  task  (a  simple  example  is  provided 
in  this  paper). 

In  this  paper  we  propose  methods  to  make  this  task  error- free.  In  particular,  methods  that  allow  the 
programmer  to  check  equivalence  of  two  programs:  the  program  before  the  transformation  and  the  one  after 
the  transformation.  This  does  not  guarantee  correctness  of  the  programs  per  se.  However,  the  original 
program  is  usually  a  straightforward  parallel  implementation  of  the  algorithm,  thus,  it  is  easier  to  check 
that  this  original  program  is  correct.  Then,  checking  equivalence  is  enough  to  guarantee  correctness  of  the 
optimized  versions  as  well.^ 

After  studying  publicly  available  GUDA  programs  [2],  it  has  come  to  our  attention  that  these  programs 
are  written  to  be  deterministic,  in  the  sense  that  their  final  result  does  not  depend  on  the  interleaving  order. 
It  is  not  surprising  for  programmers  to  want  to  write  deterministic  programs.  However,  determinism  by  no 
means  comes  for  free  in  GUDA.  It  is  achieved  by  ensuring  that  concurrent  threads  are  non-interfering,  in 
terms  of  the  variables  they  read  and  write.  Non-interference  turns  out  to  be  a  key  property  in  our  study, 
the  main  contributions  of  which  are  the  following: 

First,  we  introduce  a  simple  formal  model  for  SPMD  programs.  Second,  we  formally  define  determinism 
and  equivalence  of  such  programs.  Third,  we  propose  a  formal  notion  of  non-interference  and  show  that  it  is 
a  sufficient  condition  for  determinism.  Fourth,  we  propose  a  method  to  check  equivalence  of  non-interfering 
programs.  Our  definition  of  non-interference,  as  well  as  the  method  to  check  equivalence,  rely  on  checking 
validity  of  logical  formulas  that  can  be  automatically  derived  from  the  program.  Finally,  we  illustrate  our 
methods  throughout  the  paper  using  a  parallel  program  performing  array  inversion,  coming  from  the  GUDA 
benchmark  suite. 


2  Related  work 

Ghecking  program  equivalence  for  sequential  programs  has  been  studied  in  [31,  12].  There  is  a  large  body  of 
research  on  checking  correctness  of  parallel  programs  (e.g.,  see  [14,  21,  24,  17,  18,  22]  and  [27]  for  a  survey  of 
more  recent  work).  In  fact,  much  of  the  work  in  formal  verification  techniques  such  as  model  checking  [26,  9] 
has  been  in  part  motivated  by  the  additional  complexity  due  to  concurrency. 

Most  of  this  research,  however,  deals  with  quite  general  versions  of  the  verification  problem,  in  terms  of 
either  the  model  used  (for  instance,  general  threads  synchronizing  with  locks  or  similar  mechanisms),  or  the 
properties  that  need  to  be  checked  (which  can  be  specified,  for  instance,  using  some  general  formalism  such  as 

^  Similar  equivalence  checking  methods  are  part  of  the  standard  flow  in  circuit  design,  as  well  as  in  more  recent  methodologies 
such  as  model-based  design  (e.g.,  refining  a  Simulink  floating-point  model  to  a  fixed-point  model). 
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temporal  logic  [21]).  In  contrast,  the  SPMD  model  we  use  in  this  paper  is  restricted  (for  instance,  there  are 
no  locks,  only  barrier  synchronization),  and  we  focus  on  specific  properties:  non-interference,  determinism 
and  equivalence. 

The  interference- free  property  used  in  the  proof  framework  of  [24]  is  weaker  than  ours.  Ours  essentially 
guarantees  absence  of  races,  where  two  or  more  threads  access  the  same  memory  location  and  at  least 
one  access  is  a  write.  Races  have  been  heavily  studied  in  the  context  of  programs  with  synchronization 
mechanisms  such  as  locks.  Many  techniques  to  detect  races  that  are  not  “protected”  by  locks  have  been 
proposed,  both  static  (e.g.,  see  [3,  15])  and  dynamic  (e.g.,  see  [30]).  [11]  observes  that  this  notion  of  races  does 
not  capture  all  problematic  interactions  among  threads,  and  proposes  the  stronger  non-interference  property 
of  atomicity,  in  the  context  of  Concurrent  Java  [3].  The  fact  that  many  parallel  programs  are  written  to  be 
deterministic  has  been  observed  by  other  researchers  as  well  (e.g.,  see  [27]).  Currently,  attempts  are  being 
made  to  bring  determinism  to  mainstream  object-oriented  languages  (e.g..  Deterministic  Parallel  Java  [6]). 

[32,  25,  34]  study  verification  of  MPI  programs  [1].  MPI  is  based  on  message-passing  communication 
and  is  thus  different  from  the  SPMD  model  that  we  study  in  this  paper,  which  uses  shared  memory.  [32] 
are  interested  in  checking  equivalence  of  a  parallel  MPI  program  with  a  sequential  program.  Here  we  are 
interested  in  checking  equivalence  of  parallel  SPMD  programs. 

Non-interference  is  a  prominent  notion  in  computer  security  (e.g.,  see  [13]),  but  the  interpretation  there 
is  usually  that  information  does  not  flow  from  confidential  data  to  public  data. 

Non-interference  has  received  a  lot  of  attention  in  the  parallel  compilation  community,  in  particular  under 
the  general  problem  of  data  dependency  analysis  for  arrays  (e.g.,  see  [35]).  The  major  difference  of  this  body 
of  work  with  ours  is  that,  in  parallel  compilation,  the  problem  is  how  to  extract  parallelism  from  a  sequential 
piece  of  code  (with  loops  manipulating  arrays,  etc.),  whereas  here,  the  parallelization  has  been  performed 
by  the  programmer,  and  our  objective  is  to  prove  that  the  parallel  code  is  non-interfering. 

In  model-checking,  there  is  a  large  body  of  work  on  how  to  alleviate  state-explosion,  by  eliminating  redun¬ 
dant  interleavings  using  partial-order  reduction  (e.g.,  see  [33,  10]),  or  by  exploiting  symmetries  (e.g.,  see  [8]). 
However,  the  goal  there  is  not  to  use  non-interference  to  statically  ensure  determinism  and  equivalence. 

[4]  proposes  a  method  to  check  the  barrier-based  synchronization  patterns  of  SPMD  programs.  Incorrect 
barrier  synchronization  may  occur  when  barriers  are  executed  conditionally.  This  problem  does  not  arise  in 
our  model  where  barriers  are  assumed  to  be  unconditional.^ 


3  Background:  the  CUDA  programming  model 

There  are  obviously  many  different  types  of  concurrent  programs,  depending  on  the  parallel  architectures 
that  these  programs  are  meant  to  run  upon,  and  the  programming  model  that  they  use.  In  order  to 
facilitate  understanding  of  the  formal  model  we  present  in  Section  4,  we  provide  here  a  short  description, 
with  examples,  of  the  CUDA  model,  which  has  motivated  this  work. 

Parallel  architecture: 

CUDA  programs  are  meant  to  run  on  a  GPU,  which  typically  consists  of  a  host,  which  is  a  traditional  CPU, 
and  one  or  more  compute  devices,  which  are  massively  data-parallel  co-processors.  Each  device  consists  of  a 
set  of  cores  plus  some  global  memory,  which  can  be  accessed  by  all  cores.  Each  core  consists  of  a  processing 
element  (a  processor)  plus  some  local  memory. 

CUDA  programs: 

We  consider  in  this  paper  a  simple  class  of  CUDA,  where  a  program  consists  of  three  parts: 

global  array  declarations; 
thread  function  declaration; 
thread  spawning; 

^  [2]  states  that  “__syiicthreads ()  is  allowed  in  conditional  code  but  only  if  the  conditional  evaluates  identically  across  the 
entire  thread  block,  otherwise  the  code  execution  is  likely  to  hang  or  produce  unintended  side  effects.”  Conditional  barriers 
appear  in  only  3  out  of  57  examples  included  in  the  CUDA  SDK. 
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The  first  part  consists  of  a  list  of  declarations  of  arrays.  Memory  for  these  arrays  is  to  be  allocated  into  the 
global  memory  space  of  the  multi-core  device.  The  second  part  declares  a  thread  function,  to  be  executed 
by  each  thread  that  will  be  spawned  on  the  device.  Each  thread  function  is  a  piece  of  sequential  code  similar 
to  a  C  function.  The  third  part  consists  of  a  command  specifying  how  many  threads  to  spawn.  CUBA 
programs  are  more  general,  in  the  sense  that  they  are  general  sequential  programs  (e.g.,  written  in  C)  where 
thread  spawning  commands  can  appear  anywhere  in  the  code.  For  simplicity,  in  this  paper  we  consider  the 
restricted  class  above. 

An  example  of  a  CUBA  program  is  given  below. 

//  global  array  declaration 
float  A  [1024],  B[1024]; 

//  thread  function  declaration 

void  reversel (float*  Out,  float*  In,  int  M) 

int  i  =  noThreads  *  coreld  +  threadld; 

Dut[M-l-i]  =  ln[i]  ; 

} 

//  thread  spawning 

reversel  <<<1024>>>  (B,  A,  1024); 

The  program  declares  two  arrays  A  and  B:  they  are  to  be  stored  in  the  global  memory  of  the  device.  Then 
the  thread  body  is  declared  as  function  reversel,  where:  argument  M  is  the  size  of  the  arrays;  noThreads, 
coreld  and  threadld  are  parameters  (number  of  threads  per  core,  core  id,  and  thread  id,  respectively),  to 
be  instantiated  upon  execution,  for  each  thread.^  The  third  part  of  the  program  specifies  that  K  =  1024 
threads  must  be  spawned.  Conceptually,  a  programming  model  such  as  CUBA  gives  the  programmer  the 
impression  that  the  number  of  available  cores  is  unbounded,  thus,  K  can  be  arbitrarily  large.  If  the  number 
of  threads  per  core  (noThreads)  is  T,  then  conceptually  C  =  |"^]  —  noCores  cores  are  required  to  run  the 
program.^ 

Transforming  the  program  to  optimize  performance: 

The  array  reversal  application  can  be  rewritten  as  follows  (for  simplicity,  we  assume  K  =  C  ■  T): 

void  reverse2 (float*  Out,  float*  In) 

{ 

float  hoc [noThreads] ; 

int  i  =  coreld  *  noThreads  +  threadld; 

int  j  =  (noCores-l-coreld)  *  noThreads  +  threadld; 

int  k  =  noThreads  -  1  -  threadld; 

hoc [threadld]  =  ln[i]; 

_ syncthreads 0 ; 

Out  [j]  =  hoc  [k]  ; 

} 

The  main  idea  is  to  split  the  tasks  performed  by  the  threads  in  two  phases.  In  Phase  1,  threads  read  from  the 
input  global  array  and  store  values  in  a  local-memory  array  hoc.  A  separate  instance  of  hoc  is  allocated  at 

®  In  CUDA,  a  core  is  called  a  block  and  the  set  of  cores  is  called  a  grid.  Blocks  are  conceptually  arranged  in  the  grid  as  a  one-, 
two-,  or  three-dimensional  array.  Thus,  the  index  of  a  block  can  be  up  to  3-dimensional:  parameters  blockidx.x,  blockidx.y, 
etc.,  are  used  for  this  purpose.  Similarly,  threads  in  a  block  are  also  conceptually  arranged  in  ID,  2D,  or  3D  arrays.  This 
facilitates  programming  with  2D  and  3D  objects  that  are  frequently  used  in  computer  graphics.  For  simplicity,  we  consider 
single-dimensional  indices  in  this  paper.  However,  our  approach  directly  extends  to  multi-dimensional  indices  as  well. 

^  In  practice,  the  number  of  cores  in  a  given  device  may  be  smaller  than  [^].  Different  policies  could  be  used  in  such  a 
case.  One  such  policy  is  to  partition  the  set  of  threads  into  groups,  such  that  each  group  is  enough  to  run  on  the  available  set 
of  cores.  Then  the  groups  are  executed  in  sequence. 
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each  core.  In  Phase  2,  threads  copy  from  Loc  to  the  output  global  array,  and  in  the  process  of  doing  so  also 
reverse  the  order  of  the  values.  The  syncthreads  (barrier  synchronization)  command  ensures  that  Phase 
1  is  complete  when  Phase  2  starts.  The  new  program  achieves  better  performance  than  the  first  version, 
because  threads  access  global  memory  (array  B)  in  a  so-called  coalesced  manner:  see  [28,  2]  for  details. 

Looking  at  reverse2,  it  is  not  immediately  obvious  that  it  correctly  implements  array-reversal,  or  in 
other  words,  that  it  is  equivalent  to  the  “naive”  version  reverse  1.  In  fact,  even  in  such  a  simple  application, 
the  indices  i,  j  ,k  used  by  reverse2  are  sufficiently  complex  to  require  time  to  understand  the  logic  behind 
the  rewriting.  This  process  is  tedious  and  error-prone.  The  goal  of  this  paper  is  to  provide  tools  to  ensure 
that  nothing  goes  wrong,  that  is,  that  reverse2  is  equivalent  to  reverse  1. 


4  A  formal  model  of  SPMD  programs 

In  this  section  we  provide  a  formal  model  for  SPMD  programs.  This  model,  although  inspired  by  the 
CUBA  programming  model,  is  independent  and  can  be  used  in  other  similar  contexts  as  well.  For  reasons 
of  simplicity  in  exposition,  our  formal  model  makes  a  number  of  assumptions,  such  as  acyclicity  of  programs 
(no  loops).  Loops  are  handled  by  our  implementation  as  discussed  in  Section  7. 

A  SPMD  program  is  defined  to  be  a  tuple 


P={G,L,F) 

where  G  is  a  list  of  global  array  names,  each  with  a  type  and  size.  L  is  a  list  of  local  array  names,  each 
with  a  type  and  size.  F  is  an  automaton  formalizing  the  thread  function  of  the  program,  as  described 
below.  A  type  is  a  basic  type  such  as  boolean,  integer,  real.  The  size  of  an  array  A,  denoted  sz(A),  is  an 
arithmetic  expression  involving  constants  or  special  pre-defined  parameters  C  (number  of  processing  cores) 
and  T  (number  of  threads  per  core).  (In  the  CUBA  code  shown  in  Section  3,  C  and  T  are  represented  by 
noCores  and  noThreads,  respectively.)  Given  an  array  symbol  A,  the  size  of  A  is  denoted  sz(A)  S  N. 

The  automaton  F  modeling  a  thread  function  is  a  tuple 


F  =  {Q,qo,R) 

where  Q  is  a  finite  set  of  locations  (the  “control  states”  of  the  automaton),  qq  G  Q  is  the  initial  location, 
i?  is  a  set  of  program  transitions.  A  program  transition  is  a  tuple  {q,q',a),  also  denoted  q  q',  where 
q,q'  €  Q  are  the  source  and  destination  locations,  respectively,  and  a  is  either  a  condition  statement,  or  an 
assignment  statement,  or  the  special  sync  statement,  as  described  below.  A  program  transition  labeled  with 
a  condition  (resp.,  assignment)  statement  is  called  a  condition  (resp.,  assignment)  transition.  A  program 
transition  labeled  with  sync  is  called  a  sync  transition. 

Note  that  although  our  model  does  not  contain  explicit  local  (i.e.,  per  thread)  variables,  these  can  be 
easily  modeled  using  local  arrays. 

An  expression  can  be  of  the  following  forms:  a  constant,  such  as  0,  1.5,  true,  and  so  on;  one  of  the 
pre-defined  parameters  C,  T,  b  (representing  the  index  of  the  core  that  a  given  thread  is  running  on,  and 
ranging  from  0  to  C  —  1)  and  t  (representing  the  local  index  of  a  thread  in  its  core,  and  ranging  from  0  to 
T-1);  an  arithmetic  expression  of  the  form  e  -I-  e',  e  —  e' ,  etc.;  a  boolean  expression  of  the  form  e  >  e',  e  A  e', 
etc.;  or  an  array  expression  of  the  form  A[e],  where  A  is  an  array  name  in  G  or  L,  and  e  is  an  arithmetic 
expression  of  integer  type.  In  the  CUBA  code  shown  in  Section  3,  b  and  t  are  represented  by  coreld  and 
threadid,  respectively. 

A  condition  statement  is  a  boolean  expression.  An  assignment  statement  has  the  form  I  :=  e,  where  e  is 
an  expression  and  I  is  an  array  expression. 

Let  us  provide  an  example  of  an  SPMD  program.  This  example  models  an  array  reversal  program. 
We  first  model  the  “naive”  version  of  the  program  (with  function  “reversel”,  see  Section  3)  as  a  tuple 
Pi  =  {G,L^,F^),  with  G  =  {A[C  •  T],i3[C  •  T]},  =  0  (no  local  arrays),  and  F^  being  the  automaton 

shown  in  Figure  1  (top).  A\C.  ■  T]  denotes  an  array  of  length  C  •  T  (in  this  case  both  arrays  A  and  B  are 
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S[C  ■  T  -  1  -  T  ■  b  -  t]  :=  ^[T  ■  b  +  t] 


Loc[t]  :=  ^[T  •  b  +  t] 


sync 


B[{C  —  1  —  b)  •  T  +  t]  :=  Loc[T  —  1  —  t] 


Figure  1:  Thread  automata  (top)  and  (bottom). 


unidimensional).  The  program  implements  the  parallel  assignment  B[i]  :=  A[M  —  1  —  i],  for  z  =  0  to  M  —  1, 
where  M  —  C  ■  T.  Index  i  is  implemented  by  the  expression  T  •  b  +  t. 

A  second,  optimized  version  of  the  program  (using  function  “reverse2”,  see  Section  3)  can  be  modeled 
as  a  tuple  P2  =  {G,L^,F^),  with  G  same  as  for  Pi,  =  {Loc[T]},  and  F^  being  the  automaton  shown  in 
Figure  1  (bottom). 

It  is  not  at  all  trivial  to  see  that  the  alternative  implementation  is  equivalent  to  the  original  implemen¬ 
tation  of  array  reversal,  that  is,  produces  the  same  output  array  B  for  any  input  array  A.  Our  goal  in  this 
paper  is  to  devise  methods  to  check  that  the  two  SPMD  programs  are  indeed  equivalent. 

Structural  Assumptions: 

Let  P  =  (G,  L,  F)  be  a  SPMD  program.  We  assume  that  F  is  deterministic,  that  is,  there  is  no  location 
q  &  Q  such  that  q  has  more  than  one  outgoing  assignment  transitions,  or  both  assignment  transitions  and 
condition  transitions. 

We  also  assume  that  F  is  structurally  deadlock-free,  i.e.,  for  every  location  q,  if  all  outgoing  program 
transitions  from  q  are  labeled  with  conditions,  then  the  union  of  these  conditions  is  equivalent  to  true. 

We  also  assume  that  F  is  acyclic,  i.e.,  there  is  no  sequence  of  program  transitions  leading  from  a  given 
location  q  to  itself.  This  and  the  fact  that  Q  is  finite  implies  that  some  locations  will  have  no  outgoing 
program  transitions.  We  call  these  locations  final.  We  can  assume,  without  loss  of  generality,  that  there  is 
a  single  final  location. 

We  finally  assume  that  the  structure  of  F  is  as  illustrated  in  Figure  2,  namely,  F  is  a  chain  of  k  sub¬ 
automata,  linked  with  sync  transitions.  We  denote  this  as  F  =  Fi  ^  F2  ^  •  •  •  — >  F^.  Each  sub-automaton 
Fi  has  no  sync  transition.  Also,  each  Fi  where  i  <  k  has  a  unique  location  qs  and  a  unique  sync  transition 
{qs,q's,  sync),  such  that  q'g  is  the  “initial”  location  of  F^+i.  We  call  each  Fi  a  sync-segment.  In  the  examples 
of  Figure  1,  F^  consists  of  a  single  sync-segment  since  it  contains  no  sync  statement.  F^  consists  of  two 
sync-segments. 


Figure  2:  Structure  of  a  thread  automaton. 


We  classify  global  and  local  array  symbols  as  input  or  output.  A  global  array  A  is  an  input  array  if  it  is 
not  written  at  all  in  F,  that  is,  there  is  no  assignment  transition  in  F  of  the  form  A[e]  :=  e' .  A  is  an  output 
array  if  it  is  not  read  at  all  in  F,  i.e.,  there  is  no  assignment  transition  in  F  of  the  form  e  :=  e'  such  that  A 
appears  in  e' .  We  assume  that  all  global  arrays  can  be  classified  either  as  input  arrays,  or  as  output  arrays, 
but  not  both.  A  local  array  B  is  classified  as  input  or  output  with  respect  to  a  given  sync-segment  Fp.  B 
is  an  input  array  in  Fi  if  B  is  not  written  at  all  in  F^;  B  is  an  output  array  in  Fi  if  it  is  not  read  at  all  in 
Fi.  For  checking  equivalence,  we  will  assume  that  a  local  array  B  can  be  either  an  input  array  or  an  output 
array  in  Fi,  but  not  both.  Note  that  B  can  still  be  an  input  array  in  Fi  and  an  output  array  in  Fj  if  j  ^  i. 
Also  note  that  we  do  not  need  the  above  assumption  for  checking  non-interference  (Section  6).  For  example, 
in  both  F^  and  F^  (Figure  1)  global  array  A  is  an  input  array  and  global  array  B  is  an  output  array.  Local 
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array  Loc  is  an  input  array  in  the  first  sync-segment  of  and  an  output  array  in  the  second  sync-segment 
of 

Instantiation  and  Semantics: 

A  SPMD  program,  although  it  refers  to  parameters  C  and  T,  does  not  instantiate  these  parameters. 
Indeed,  in  principle,  a  SPMD  program  should  be  written  independently  of  the  actual  values  of  these  param¬ 
eters,  and  should  work  correctly  in  any  instantiation.  An  instance  of  a  SPMD  program  P  =  {G,L,F)  is 
represented  as  a  tuple  {P,C,T),  where  C,T  G  N  are  two  positive  integers,  representing  the  instantiation  of 
parameters  C  and  T,  respectively.  In  a  SPMD  program  instance  (P,  C,  T),  there  are  in  total  C  ■  T  threads 
running  concurrently,  each  executing  the  sequential  program  described  in  F.  The  set  {0, 1,  ...,C  •  T  —  1}  is 
called  the  set  of  global  thread  indices  for  {P,C,T),  denoted  X{C,T). 

In  a  SPMD  program  instance  (P,  C,  T),  there  is  a  single  copy  of  every  global  array  A  G  G.  On  the  other 
hand,  each  local  array  B  G  L  is  instantiated  G  times,  representing  the  fact  that  there  is  one  copy  of  B 
allocated  at  each  core.  Elements  of  an  array  A  are  indexed  from  0  to  sz(A,  C,  T)  —  1,  where  sz(A,  C,  T)  is  the 
integer  number  obtained  by  replacing,  in  sz(A),  C  by  (7  and  T  by  T,  and  evaluating  the  resulting  expression. 

Consider  an  expression  e:  it  generally  involves  global  or  local  memory  array  symbols,  constants,  and 
the  parameters  C,  T,  b,  t.  By  fixing  these  parameters  to  concrete  positive  integer  values  G,  P,  b,  t,  we  get  a 
parameter-free  expression,  that  is,  an  expression  involving  only  array  symbols  and  constants.  We  denote  this 
parameter-free  expression,  obtained  by  substituting  concrete  values  to  the  parameters,  by  e(C/C',  T /T,  b/6,  t/t), 
or  e{C,  T,  b,  t)  in  short. 

The  semantics  of  a  SPMD  program  instance  {P,G,T),  denoted  |P,  (7,  P],  is  defined  to  be  a  labeled 
transition  system  (LTS)  |P,  (7,  P]  =  (S',  ^o,— >)  where: 

S  is  the  set  of  states.  Each  state  s  €  S  is  a  partial  function  that  assigns  a  value  to  each  element 
of  every  global  array  A  G  G,  to  each  element  of  every  instance  B^  of  every  local  array  B  G  L,  where 
k  G  {0, 1, ...,  G  —  1},  and  to  every  instance  of  a  program  counter  variable  pc*  G  Q,  which  records  the  location 
of  thread  n,  where  n  G  X{G,T).  States  are  partial  functions  because  some  arrays  may  not  be  initialized 
(however,  we  will  enforce  initialization  below).  We  will  denote  by  s(v),  s(pc"),  s(A[z]),  etc.,  the  values  of  v, 
pc,  A[i],  etc.,  in  state  s.  If  s(pc”)  is  the  final  location  for  all  n  G  P((7,  P),  then  s  is  called  a  final  state. 

So  C  S  is  the  set  of  initial  states.  For  each  s  G  Sq,  and  for  every  n  G  P(C',  P),  we  have:  s(pc”)  =  qo-  Also, 
every  array  in  G  (resp.,  L)  assumes  one  of  the  possible  values  in  Gq  (resp.,  Lq).  Array  elements  can  have 
arbitrary  initial  values,  however,  we  will  assume  that  local  arrays  are  guaranteed  to  be  initialized  during 
execution  (see  below). 

^  is  a  set  of  labeled  transitions.  Each  transition  is  a  triplet  (si,  /3,  S2),  also  denoted  si  S2,  where  si,  S2 
are  states  and  [3  is  either  sync,  or  a  pair  (n,  a),  where  n  G  P((7,  P)  and  a  is  a  condition  or  an  assignment. 

Given  a  state  s  and  a  parameter-free  expression  e,  s(e)  denotes  the  value  of  e  at  state  s:  this  is  the 
value  obtained  by  replacing  all  sub-expressions  A[j\  of  e  by  s(A[j])  and  performing  any  arithmetic  or  logical 
operations  in  e.  We  say  that  a  boolean  expression  e  is  satisfied  at  s  if  s(e)  evaluates  to  true. 

When  e  is  not  parameter-free,  its  evaluation  generally  depends  not  only  on  the  state  s,  but  also  on  the 
global  thread  index  n  G  P((7, P)  (as  well  as  the  values  G  and  P,  of  course).  Given  n  G  I{G,T),  define  b(n) 
and  t(n)  to  be  the  quotient  and  the  remainder  of  the  division  respectively: 

n  =  b(n)  •  P -I- t(n).  (1) 

Let  us  now  define  the  rules  for  the  transitions  of  the  LTS  |P,  (7,  P].  First,  consider  the  case  si  S2, 
where  n  G  I{G,T).  In  this  case  we  adopt  the  usual  interleaving  semantics,  where  only  thread  n  moves  and 
all  other  threads  remain  at  the  same  location.  Let  pCj  =  Si(pc"),  for  i  =  1,  2.  Then,  F  must  have  a  program 
transition  (pcj^,  pc2,Q;)  and  one  of  the  following  must  hold: 

•  Either  a  is  a  condition  e  and  the  parameter-free  expression  e(C',  P,  b(n),  t(n))  is  satisfied  at  state  si. 
In  this  case,  the  values  of  all  variables,  except  pc",  remain  the  same  in  S2  as  in  si. 

•  Or  a  is  the  assignment  I  :=  e  where  I  is  some  array  expression  A[e'].  Let  v  =  Si(e((7,P,  b(n),t(n))), 
i.e.,  V  is  value  that  e  assumes  in  si  when  evaluated  by  thread  n.  Let  j  =  si(e'((7,  P,  b(n),t(n))).  Then 
S2  is  identical  to  si,  except  that  the  value  of  array  element  A[j]  in  S2  is  set  to  v,  and  S2(pc")  =  pc2. 
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Second,  consider  the  case  Si  S2-  In  this  case  all  threads  synchronize  and  move  simultaneously.  Then, 
F  must  have  a  transition  (pc^,  pc2,sync),  such  that  Vn  S  I{C,T)  :  si(pc")  =  pc^  A  S2(pc")  =  pc2.  (Notice 
that,  because  we  assume  that  the  thread  automaton  is  a  “chain”  of  sub-automata  linked  by  sync  transitions, 
it  is  not  possible  for  different  threads  to  synchronize  while  being  at  different  locations.)  The  value  of  all 
other  variables  except  program  counters  remains  unchanged. 

Our  semantics  assumes  that  assignments  are  atomic,  that  is,  they  cannot  be  interrupted  by  other  threads. 
This  assumption  may  seem  unrealistic,  especially  in  cases  where  the  expressions  involved  in  the  assignment 
are  long  (i.e.,  require  many  computation  steps),  involve  accesses  to  global  memory,  etc.  It  is  true  that  in  such 
cases  execution  of  these  assignments  may  not  be  atomic.  This  problem  can  be  overcome,  however,  during 
the  modeling  phase:  thread  automata  are  only  a  modeling  formalism,  not  a  programming  language.  When 
translating  from  a  programming  language  (e.g.,  such  as  CUBA)  to  thread  automata,  care  can  be  taken  to 
“split”  non-atomic  statements  into  sequences  of  atomic  ones. 

A  run  in  the  LTS  |P,  C,  T]  is  a  sequence  of  fc  >  0  transitions,  starting  at  an  initial  state: 


P=  So 


01  02 
'  Si  ^ 


0k 


Sk, 


where  sq  G  Sq,  and  Si  ^  Si+i  is  a  transition  of  |P,  C,  T],  for  i  =  0, ...,  A:  —  1.  We  say  that  the  run  p  reaches 
state  Sk  and  Sk  is  called  a  reachable  state.  The  set  of  all  reachable  states  of  P  with  respect  to  C,  T  is  denoted 
TZ{P,C,T).  The  run  p  is  called  maximal  if  Sk  is  a  final  state.  The  set  of  all  reachable  final  states  of  P 
with  respect  to  C,T  is  denoted  TZf{P,C,T).  Note  that  every  run  in  |P,  C,  T]  is  finite:  this  follows  from 
the  assumption  that  F  is  acyclic.  On  the  other  hand,  the  sets  TZ{P,C,T)  and  Tlf{P,C,T)  may  be  infinite, 
because  the  domains  of  state  variables  (arrays)  may  be  infinite. 

Assignment  Assumptions: 

Let  p  =  So  ^  Si  ^  ^  Sfc  be  a  run  in  the  LTS  |P,  C,  T]  and  let  A  be  a  global  or  local  array  of  P. 

Let  i  G  {!,...,  fc}  and  let  j  G  {0,  ...,sz(A,  C,T)  —  1}.  We  say  that  the  j-th  element  of  A  is  written  in  the 
i-th  transition  of  p,  if  f3i  =  {n,A[e]  :=  e')  and  Si_i(e(C,  T,  b(n),  t(n)))  =  j.  We  say  that  the  j-th  element 
of  A  is  read  in  the  f-th  transition  of  p,  if  j3i  =  {n,  I  :=  e)  and  e  contains  a  sub-expression  A[e'\  such  that 
s,_i(e'(C,r,  b(n),t(n)))  =  j. 

We  assume  that  |P,  C,  T]  satisfies  the  local  array  initialization  (LAI)  property.  Intuitively,  LAI  states 
that  every  local  array  element  is  initialized  before  used.  This  means  that  every  element  of  the  array  is 
written  (by  some  thread)  before  the  same  element  is  read  (by  the  same  or  possibly  some  other  thread). 
This  assumption  is  semantical:  it  must  hold  in  every  possible  execution  of  the  program.  Formally,  this  is 

expressed  as  follows.  Let  p  =  sq  ^  si  ^  ^  be  a  run  in  the  LTS  |P,  C,  T]  and  let  A  be  a  local  array 
of  P.  Then,  if  there  exists  j  G  {0, ...,  sz(A,  C,  P)  —  1}  and  i  G  {!,...,  A:}  such  that  A[j]  is  read  in  the  z-th 
transition  of  p,  then  there  exists  some  I  G  {1,  ...,z  —  1}  such  that  A[j]  is  written  in  the  Pth  transition  of  p. 

We  also  assume  that  |P,  C,  T]  satisfies  the  single  array  assignment  (SAA)  property.  Intuitively,  SAA 
states  that  every  element  of  a  global  or  local  output  array  is  assigned  exactly  once  in  every  execution  of  the 

system.  Formally,  this  is  expressed  as  follows.  Let  p  =  sq  ^  si  ^  ^  be  a  run  in  the  LTS  |P,  C,  P]  and 

let  A  be  a  local  or  global  array  of  P.  If  A  is  not  a  global  input  array,  then  for  all  j  G  {0, ...,  sz(A,  C,  T)  —  1}, 
there  must  be  exactly  one  i  G  {1, ...,  A;}  such  that  A[j]  is  written  in  the  z-th  transition  of  p. 

The  above  assumptions  are  not  generally  guaranteed  by  our  SPMD  model.  They  can  be  enforced,  however, 
by  conditions  similar  to  the  non-interference  condition  presented  in  Section  6.  The  details  are  omitted  due 
to  lack  of  space. 


5  Properties  of  interest 

Our  ultimate  goal  is  to  provide  a  method  for  proving  equivalence  of  SPMD  programs.  But  what  does 
equivalence  exactly  mean?  For  sequential  programs,  which  are  deterministic,  it  is  reasonable  to  define 
equivalence  as  follows:  programs  Pi  and  P2  are  equivalent  if,  given  the  same  inputs,  they  produce  the 
same  outputs.  This  definition  does  not  directly  apply  to  SPMD  programs,  because  the  latter  are  inherently 


non-deterministic:  the  outputs  of  a  SPMD  program  may  be  different  depending  on  the  particular  order  of 
thread  interleavings.  We  are  thus  motivated  to  define  determinism  first,  and  then  define  equivalence  for 
deterministic  programs. 

We  must  also  define  precisely  what  we  mean  by  “inputs”  and  “outputs” .  Usually,  in  GPU  applications, 
one  is  not  interested  in  the  values  of  local  arrays  or  other  local  variables,  but  only  in  the  values  of  global  arrays. 
Motivated  by  this,  we  introduce  the  following  equivalences.  Consider  a  SPMD  program  P  =  {G,L,F).  Let 
|P,  C,  T]  =  {S,  So,  for  given  C,T  S  N.  Two  states  s,s'  G  S  are  said  to  be  equivalent,  denoted  s  «  s',  if  for 

each  A  £  G,  for  any  i  £  {0,  ...,sz{A,G,T)  —  1},  s(A[j])  =  s'(A[i]).  Let  pi  =  sj  ^  s}  ^  and  p2  = 
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be  two  runs  in  \P,C,T\.  The  two  runs  are  said  to  be  equivalent,  denoted  pi  ~  p2,  if 
Sq  «  Sq  «  sf.^,  that  is,  assuming  all  global  arrays  have  the  same  value  when  the  programs  begin,  they 

will  have  the  same  value  when  the  programs  end.  The  two  runs  are  said  to  be  strongly  equivalent,  denoted 


Pi  -  P2,  if  4  =  So  ^  4i  =  42- 

The  above  definitions  extend  to  state  of  two  LTSs  coming  from  different  SPMD  programs  Pi  and  P2, 
with  potentially  different  instantiations  of  parameters  C,  T,  as  long  as  Pi  and  P2  have  the  same  set  of  global 
arrays.  We  will  use  this  to  define  equivalence  between  SPMD  programs  below.  For  simplicity,  we  will  assume 
that  Pi  and  P2  have  identical  sets  of  input  and  output  global  arrays:  that  is,  array  A  is  an  input  (resp., 
output)  array  in  Pi  iff  it  is  an  input  (resp.,  output)  array  in  P2.  We  will  also  assume  that  parameters  C,T 
are  instantiated  identically  in  the  two  programs.  Both  assumptions  can  be  lifted  without  compromising  the 
results  of  our  framework,  however,  this  would  make  the  presentation  heavier,  and  we  opt  for  simplicity. 
Determinism: 

Let  P  be  a  SPMD  program  and  let  (7,  T  G  N.  P  is  said  to  be  deterministic  with  respect  to  C,  T  if  for  any 
two  maximal  runs  p  and  p'  in  |P,  C,  T],  we  have  p  ~  p' .  If  p  ~  p'  then  P  is  said  to  be  strongly  deterministic 
with  respect  to  C,  T.  P  is  said  to  be  deterministic  (respectively,  strongly  deterministic)  if  it  is  deterministic 
(respectively,  strongly  deterministic)  with  respect  to  G,  T,  for  any  C,T  S  N. 

Program  Equivalence: 

Let  Pi  =  {G,Li,Fi)  and  P2  =  (G,L2,F2)  be  two  SPMD  programs  with  identical  sets  of  global  arrays. 
Let  C,  P  G  N.  Pi  and  P2  are  said  to  be  equivalent  with  respect  to  G,T,  denoted  Pi  ^c,t  P2,  if  (1)  Pi  is 
deterministic  with  respect  to  G,  T,  (2)  P2  is  deterministic  with  respect  to  C,  T,  and  (3)  for  all  maximal  runs 
Pi  in  |Pi,  C,  T]  and  p2  in  IP2,  G,  T],  we  have  pi  «  p2-  Let  P  be  a  subset  of  N,  representing  a  set  of  conditions 
on  parameters  G,T.  We  say  that  Pi  and  P2  are  equivalent  with  respect  to  P,  denoted  Pi  «r  P2i  if  for  all 
(C,  T)  G  P,  we  have  Pi  ^2- 


6  Analysis 

Our  goals  are  the  following:  (1)  to  check  whether  a  given  SPMD  program  is  deterministic,  and  (2)  to  check 
whether  two  deterministic  SPMD  programs  are  equivalent.  A  key  property  in  achieving  these  goals  is  non¬ 
interference,  which  roughly  states  that  different  threads  access  different  array  elements,  or  the  same  element 
but  at  different  times. 

Non-Interference: 

In  the  system  |P,  (7,  T],  there  are  G  ■  T  threads  running,  where  G  is  the  number  of  cores  and  T  the 
number  of  threads  per  core.  All  these  threads  may  access  the  same  locations  of  global  memory.  Moreover, 
for  each  core,  the  T  threads  running  on  that  core  may  access  the  same  location  of  local  memory  of  this  core. 
To  ensure  determinism,  we  need  to  ensure  that  no  race  conditions  occur  in  these  global  or  local  memory 
accesses.  Race  conditions  can  occur  when  two  threads  access  the  same  memory  location,  at  least  one  access 
is  a  write,  and  the  two  accesses  may  happen  in  any  order.  Non-interference  ensures  that  race  conditions  do 
not  occur. 

Let  F  be  the  thread  automaton  on  which  we  wish  to  ensure  absence  of  race  conditions  like  the  above. 
Because  of  the  chain-of-sub-automata  assumption  (Figure  2),  it  suffices  to  ensure  the  absence  of  race  con¬ 
ditions  separately  on  each  sync-segment  Fi  of  F.  Indeed,  from  the  fact  that  threads  must  synchronize  on 
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sync  transitions,  it  is  impossible  for  two  sync-segments  Fi,  Fj  with  i  ^  j  to  interfere:  ii  i  <  j  then,  in  any 
execution,  all  transitions  of  Fi  are  guaranteed  to  take  place  before  any  transition  of  Fj. 

Thus,  it  suffices  to  check,  for  each  sync-segment  Fi  of  F,  that  it  cannot  interfere  with  itself.  In  other 
words,  that  we  cannot  have  two  threads  executing  statements  of  Fi  that  interfere  with  each  other.  Notice 
that  Fi  is  a  special  case  of  a  thread  automaton,  without  sync  transitions,  except  for  the  transition  from  Fi 
to  the  next  sync-segment  Fi+i.  Then,  let  Fi  be  the  thread  automaton  {Q,qQ,R). 

We  define  the  following  two  sets  of  expressions: 

LHS(Ti),  called  the  set  of  all  left-hand  side  expressions  of  Fi,  is  defined  to  be  the  set  of  all  expressions  I 
such  that  I  :=  e  is  some  assignment  statement  of  Fi. 

RHS(Ti),  called  the  set  of  all  right-hand  side  expressions  of  Fi,  is  defined  to  be  the  set  of  all  array 
sub-expressions  of  an  expression  e,  such  that  either  I  :=  e  is  some  assignment  statement  of  Fi  or  e  is  some 
condition  statement  of  Fi.  An  array  sub-expression  of  e  is  a  sub-expression  of  e  which  is  also  an  array 
expression.  For  example,  if  e  =  A[3  -I-  i?[t]]  then  e  has  two  array  sub-expressions:  e  itself  and  S[t]. 

LHS  only  contains  array  expressions,  since,  by  definition,  in  every  assignment  I  :=  e,  I  is  an  array 
expression.  The  reason  we  include  only  array  expressions  in  RHS  is  because  only  array  expressions  can  be 
assigned  to,  thus,  only  such  expressions  can  interfere  with  each  other.  Although  we  could  have  included  all 
sub-expressions  in  RHS  without  affecting  the  results  given  below,  this  would  result  in  redundant  expressions 
in  RHS.  Note  that  LHS  and  RHS  are  finite  sets. 

Let  us  illustrate  the  definitions  of  LHS  and  RHS  on  our  running  example.  First,  consider  thread  automaton 
F^  (Figure  1,  top).  F^  has  no  sync  transitions,  therefore,  it  consists  of  a  single  sync-segment:  F^  itself.  We 
have: 

LHS(Fi)  =  {B[C-T-l-T-b-t]}  and  RHS(FI)  =  {A[T  •  b -k  t]}. 

Next,  consider  thread  automaton  (Figure  1,  bottom).  consists  of  two  sync-segments:  F^  =  F^^  — >  F|. 
We  have: 

LHS(F2)  =  {Foc[t]},  RHS(F2)  =  {A[T  ■  b  + 1]}, 

LHS(F|)  =  {B[{C  -  1  -  b)  •  T  -f  t]},  RHS(F|)  =  {Foc[T  -  1  -  t]}. 

We  next  define  two  set  of  potentially  interfering  expression  pairs  of  Fi.  The  set  £g{Fi)  is  defined  to 

be  the  set  of  all  (61,62)  such  that  there  exists  global  array  symbol  A  S  G  such  that  A[6i]  G  LHS(Fi)  and 

A[e2]  G  LHS(Fi)  U  RHS(Fi).  The  set  £i{Fi)  is  defined  to  be  the  set  of  all  (61,62)  such  that  there  exists 
local  array  symbol  B  G  L  such  that  F[6i]  G  LHS(Fi)  and  F[e2]  G  LHS(Fi)  U  RHS(Fi).  The  intuition  is  that 
two  threads  interfere  iff  there  exists  a  pair  of  potentially  interfering  expressions  (61,62)  such  that  ei  and 
62  evaluate  to  the  same  value  in  the  two  threads.  Notice  that  we  need  not  worry  about  expressions  of  the 
form  A[ei]  G  LHS(Fi)  and  F[62]  G  LHS(Fi)  U  RHS(Fi),  where  A  and  B  are  different  array  symbols.  This  is 
because,  even  if  ei  and  62  can  be  made  equal,  A  and  B  refer  to  different  locations  in  memory,  thus,  there  is 
no  possibility  for  races. 

Let  F  be  a  thread  automaton  such  that  F  =  Fi  ^  ^  F^.  Fix  C,T  G  N.  We  say  that  a  sync-segment 

Fi  is  non-interfering  with  respect  to  C,  T  if 

1.  for  every  expression  pair  (ci,  62)  G  £g{Fi),  the  following  formula  is  valid: 

V6i,62  G  {0,...,G-  l},Vti,t2  G  {0,...,F-  1}  : 

(61  yf  &2  V  fi  ^2)  =k  6i(G,  F,  6i,ti)  y^  62(G,  F,  62,  t2) 

2.  for  every  expression  pair  (ei,  62)  G  £i{Fi),  the  following  formula  is  valid: 

V6  G  {0, ...,  C  -  1},  VF,  t2  G  {0, ...,  F  -  1}  :  F  y^  F  ^  6i(G,  F,  b,  F)  yf  e2(G,  F,  b,  F) 

The  above  formulas  are  formulas  of  first-order  logic  with  equality,  with  array  symbols  considered  to  be  unary 
function  symbols. 

We  say  that  F  is  non-interfering  with  respect  to  C,T  if  for  all  i  G  {!,...,  fc},  Fi  is  non-interfering  with 
respect  to  C,  T.  We  say  that  Fi  is  non-interfering  if  it  is  non-interfering  with  respect  to  C,  T  for  all  C,T  G  N. 
We  say  that  F  is  non-interfering  if  for  all  i  G  {1, ...,  k},  Fi  is  non-interfering. 
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Theorem  1  Let  P  =  {G,L,F)  be  a  SPMD  program  and  let  C,  T  G  N.  If  F  is  non-interfering  w.r.t.  C,T 
then  P  is  strongly  deterministic  with  respect  to  C,  T . 

Proofs  can  be  found  in  Appendix  A. 

Let  us  apply  Theorem  1  to  show  that  the  SPMD  program  of  Figure  1  (top)  is  deterministic.  The  sets 
LHS(F"^)  and  RHS(F^)  have  been  given  above.  According  to  the  definition  above,  £g{F^)  =  {(e,  e)},  where  e 
is  C-T— 1  —  b-T  —  t,  and  £i{F^)  =  0.  To  show  non-interference,  we  must  prove  that  for  all  C, T  G  N,  for  all 
&i,  &2  S  {0, ...,  C  —  1}  and  for  all  ti,t2  G  {0, ...,  T  —  1}  such  that  bi  ^  62  or  h  ^  t2,  the  following  inequality 
holds: 

C  •  T  -  1  -  (61  •  T  +  ti)  ^  C  •  r  -  1  -  (&2  •  T  +  tz). 

This  follows  directly  from  the  assumptions.  Similarly,  we  can  show  that  the  alternative  array-reversal 
program  P2  with  thread  automaton  is  also  non-interfering.  consists  of  two  sync-segments,  Ff  and  F|. 
Following  the  definitions,  we  get:  £g{Ff)  =  0,  £i{Ff)  =  {(t,  t)},  £g{Ff)  =  {(e,  e)},  where  e  is  (C—  1  — b)  -T-l-t, 
and  £i{Ff)  =  0.  Then,  to  prove  that  F^  is  non-interfering,  we  show  the  two  facts:  ti  ^  t2  ^  ti  ^  t2,  and 

VC,TgN:V6i,&2  G  {0,...,C-l},Vti,t2  G  {0,...,F-1}  : 

(61  yf  62  V  F  7^  ^2)  (C*  ~  1  ~  bi)  ■  T  +  yf  (C  —  1  —  62)  ■  T  +  ^2- 

It  is  instructive  to  consider  a  third  implementation  of  array  reversal,  which  does  not  satisfy  the  non¬ 
interference  property.  This  happens  if  we  remove  the  sync  statement  from  thread  automaton  F^:  call  the 
resulting  thread  automaton  F^.  F^  has  a  single  sync-segment  (itself)  and  we  have: 

LHS(f3)  =  {Foc[t],  F[(C  -  1  -  b)  •  T  +  t]},  RHS{F^)  =  {A[b  •  T  + 1],  Foc[T  -  1  -  t]}. 

Then,  £i{F^)  includes  the  pair  (t,  T— 1— t)  and  we  can  no  longer  prove  the  implication  yf  F  F  y^  F— 1— F- 
In  fact,  the  implication  can  be  shown  to  be  false  simply  by  setting  F  =  0  and  ^2  =  T  —  1.  Thus,  F^  is 
interfering.  In  fact,  it  can  be  seen  that  this  implementation  is  non-deterministic,  and  incorrect. 

Checking  Equivalence: 

Let  Pi  and  P2  be  two  deterministic  SPMD  programs  with  identical  sets  of  global  arrays.  Let  F  be  a  subset 
of  N^,  representing  a  set  of  conditions  on  parameters  C,T.  We  represent  the  set  F  C  by  its  characteristic 
formula  (f>r-  the  latter  is  a  boolean  expression  on  parameters  C,T,  such  that  a  tuple  (C,  T)  G  is  in  F  iff 
it  satisfies  ^r-  We  want  to  check  whether  Pi  F2.  We  do  this  in  two  steps:  (1)  For  each  Pi,  i  =  1,2,  we 
compute  a  post- condition  d>p. .  The  latter  is  a  formula  that  relates  global  and  local  array  values  at  the  end 
of  program  execution.  (2)  We  check  whether  the  post-conditions  imply  equality  of  global  output  arrays.  We 
next  make  these  steps  precise  and  illustrate  them  on  our  running  example. 

Let  P  =  {G,L,F)  be  a  SPMD  program.  Let  11  denote  the  set  of  all  control-fiow  paths  in  F,  that  is,  all 
paths  from  the  initial  location  qq  of  F  to  some  final  location  (recall  that  F  is  acyclic,  therefore  11  is  a  finite 
set).  For  each  tt  G  11  we  will  compute  a  boolean  expression  Let  tt  =  qq  ^  qi  ^  pm-  Each  at  in  tt 

is  either  a  condition  statement,  that  is,  a  boolean  expression  e,  or  an  assignment  I  :=  e.  Define  ifi  to  be:  e, 

if  ai  is  the  condition  e,  and  Z  =  e  if  is  the  assignment  I  :=  e.  Then,  we  define  to  be  the  conjunction  of 

all  ipi,  and  the  post- condition  of  P,  denoted  <i>p,  to  be  the  disjunction  of  all 

'('tt  :=  /\  and  $p  :=  (j)^  (2) 

2=1,... ,m  TT^n 

In  all  ipi,  local  array  symbols  are  superscripted  by  b  (e.g.,  Loc  will  appear  as  Loc^).  This  is  because  there  is 
a  separate  copy  of  every  local  array  at  each  core,  and  we  need  to  refer  to  each  copy  individually. 

Let  us  return  to  our  running  example  of  array  reversal  programs  Pi  and  P2,  with  thread  automata  F^ 
and  F^,  respectively  (Figure  1).  The  post-conditions  for  Pi  and  P2  are: 

$P,  :=  F[C  •  T  -  1  -  (b  •  T  -b  t)]  =  A[b  •  T  -b  t] 

$P,  :=  Loc'^lt]  =  A[b  •  T  -b  t]  A  B[{C  -  1  -  b)  •  T  -b  t]  =  Loc°[T  -  1  -  t] 
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Theorem  2  Let  P  be  a  non-interfering  SPMD  program  w.r.t.  C,T  G  N.  For  all  s  G  TZf{P,C,T),  the 
following  formula  is  satisfied  at  state  s: 

VbG  {0,...,C-l},VtG  {0,...,T-1}  :$p(C,r)  (3) 

In  the  above  theorem,  ^p{C,T)  denotes  the  formula  obtained  by  replacing  variables  C,T  in  <i)p  by 
concrete  values  C,  T. 

Let  Pi  and  P2  be  two  SPMD  programs  having  the  same  set  G  of  global  arrays.  Recall  that  Pi  and  P2 
must  have  the  same  classification  of  global  arrays  into  input  and  output  arrays.  Let  Gout  C  G  be  the  set  of 
output  global  arrays.  For  i  =  1,2,  let  (pi  be  the  post-condition  formula  d>p. ,  with  the  addition  that  array 
symbols  in  Gout  are  labeled  by  superscript  L  That  is,  B  G  Gout  will  appear  as  in  pi  and  as  B^  in  p2- 
This  is  done  to  distinguish  the  outputs  of  the  two  programs.  All  local  array  symbols  that  are  common  to 
both  programs  are  also  superscripted  similarly.  Input  global  array  symbols  do  not  need  to  be  distinguished: 
in  fact,  by  having  the  same  input  symbols  in  both  formulas,  we  implicitly  encode  the  assumption  that  input 
array  values  are  the  same  for  both  programs. 

We  define  formulas  ppost  and  pout  as  follows: 

Ppost  ■=  Vb  G  {0,  ...,C- l},Vt  G  {0,  ...,T- 1}  :  A(/)2 

Pout  ■■=  /\  yj  G  {0,...,sz{B)  -  1} B^[j]  =  B^[j] 

BeGout 

We  represent  the  set  F  C  by  its  characteristic  formula  pp:  the  latter  is  a  boolean  expression  on 
parameters  C,T,  such  that  a  tuple  {G,T)  G  is  in  F  iff  it  satisfies  i^r- 

Theorem  3  Suppose  Pi  and  P2  are  non-interfering  SPMD  programs,  with  respect  to  any  C,  T  such  that 
{G,T)  G  F.  Then,  Pi  «r  P2  if  the  following  formula  is  valid: 

VC,T  G  N  :  {pr/\  Ppost)  =>  P  out 

Formula  (4)  instantiates  on  our  running  example  as  follows: 

VC,  T  G  N  :  (Vb  G  {0, ...,  C  -  1},  Vt  G  {0, ...,  T  -  1}  : 

Ri[C-T-l-(b-T-ht)]  =  A[b-T-Kt]  A  Loc'^[t]  =  A[b  •  T  + 1]  A 
R2[(C  -  1  -  b)  •  T  -h  t]  =  LoP^lT  -  1  -  t]) 

^VjG{0,...,C-T-1}:R1[j]=^"[j] 

A  proof  that  Formula  (5)  is  valid  can  be  found  in  Appendix  B. 

7  Implementation  and  experiments 

We  have  built  a  prototype  tool  that  can  automatically  check  non-interference  of  CUBA  programs.  Equiva¬ 
lence  checking  has  not  been  implemented  yet  in  the  tool,  but  the  non-interference  checking  functionality  is 
useful  independently,  and  not  available  in  other  tools,  as  far  as  we  know  at  the  time  of  writing. 

Our  tool  uses  OIL  (http://hal.cs.berkeley.edu/cil/)  to  parse  and  analyze  CUBA  programs.  The 
tool  then  generates  non-interference  conditions  that  are  submitted  to  the  Yices  SMT  solver  (http :  //yices . 
csl.sri.com/).  Yices  cannot  handle  non-linear  constraints,  therefore,  in  expressions  such  as  b-T-\-t,  where 
b  and  t  are  the  core  and  thread  ID  variables,  respectively,  and  T  is  the  number  of  threads  per  core,  we 
instantiate  T  to  a  constant.  Our  tool  can  handle  multidimensional  arrays. 

At  present  our  tool  can  run  on  the  reverse  1,  reverse2  programs  presented  in  this  paper  and  on  the 
following  programs  from  the  CUBA  SDK  suite  [2]:  clock,  nbody,  simpleZeroCopy  and  transpose.  All 
these  programs  are  proved  non-interfering  completely  automatically  in  <  1  sec.  Our  tool  currently  handles 
loops  with  statically  known  bounds  by  unrolling  the  loop.  This  works  for  the  programs  above  but  does  not 
work  for  all  programs.  For  example,  the  BlackScholes  CUBA  application  contains  the  following  thread 
function  pattern: 


(4) 


(5) 
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const  int  tid  =  noThreads  *  coreld  +  threadid; 
const  int  TN  =  noThreads  *  noCores ; 
for (int  i  =  tid;  i  <  N;  i  +=  TN) 

BlackScholesBodyGPU(A[i] ,  B[i],  ...); 

where  tid  is  computed  as  the  global  thread  index  T  •  b  +  t,  TN  =  C  •  T  is  the  total  number  of  threads,  A, 
B,  are  arrays,  N  is  the  size  of  these  arrays,  and  BlackScholesBodyGPU  is  the  function  the  performs  the 
computation. 

Such  cases  can  be  handled  by  adapting  the  non-interference  conditions  that  need  to  be  checked.  In 
particular,  we  can  generate  non-interference  conditions  of  the  form: 


V6i,62  e  {0,...,C'-l},Vti,t2G  {0,...,T-l},V^l,^2  gN: 

((6i  yi  &2  V  yf  t2)  A  (3fci,  fc2  G  N  :  ii  =  fci  •  TN  -p  A  12  =  *2  •  TN  -p  t2  A  <  N  A  t2  <  *2  <  N)) 


■  ei  (C.  T.  b. 


where  ii,J2  are  variables  corresponding  to  the  instantiation  of  the  loop  index  i  for  the  two  threads,  and 
variables  fci,  ^2  represent  loop  iterations,  ei,  62  are  left-hand  or  right-hand  side  expressions  potentially  using 
variables  ii,i2,  in  addition  to  variables  C,T,  and  so  on.  Then,  ej{C,T,bj,tj,ij),  for  j  =  1,2,  denotes  the 
expression  obtained  by  substituting  the  values  of  these  variables,  as  described  in  Section  6. 

It  is  worth  noting  that,  in  order  to  prove  that  the  above  non-interference  condition  is  valid,  it  suffices  to 
prove  that  the  following,  quantifier-free  formula,  corresponding  to  its  negation,  is  unsatisfiable: 


0  <  6i,&2  <  C  A  0  <  fi,t2  <  TA  (&i  y^  62  Vfi  y^  (2)  A  =  fci  •  TN -P  A  12  =  fca  •  TN -P  fa  A 

ti  ^  *1  <  NAta  ^  ^2  <  NAei(C,T,  5i,ti,Zi)  =  e2(C,T,  &2j^2;*2) 


Because  such  formulas  are  quantifier-free,  they  can  be  directly  handled  by  SMT  solvers.  For  instance,  to 
ensure  that  different  threads  don’t  write  to  the  same  A  [i]  element  in  the  BlackScholes  example,  it  suffices 
to  prove  unsatisfiability  of  the  above  formula,  where  ei{C,T,bi,ti,ii)  =  e2(C,  T,  627  (21  *2)  instantiates  to 
*1  =  *2-  Yices  takes  <  1  sec  to  prove  the  formula  unsatisfiable  for  C  =  T  =  256  and  N  =  256^. 

8  Conclusions  and  ongoing  work 

We  have  proposed  a  novel  framework  for  proving  determinism  and  equivalence  of  SPMD  programs.  Our 
framework  relies  on  a  notion  of  non-interference  requiring  that  different  threads  access  different  array  ele¬ 
ments,  or  the  same  element  but  at  different  times  (thanks  to  barrier  synchronization). 

We  are  currently  working  toward  strengthening  our  tool  so  that  it  can  handle  a  larger  set  of  CUBA 
programs.  Another  promising  direction  is  to  cast  the  framework  in  a  theory  of  arrays.  Even  though  features 
such  as  array  nesting  (A[i3[e]])  generally  result  in  undecidability  [7],  we  may  be  able  to  exploit  the  restricted 
form  of  formulas  used  in  our  framework  to  obtain  more  positive  results. 
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A  Proofs 

A.l  Proof  of  Theorem  1 

Define  the  run  p*  of  |P,  C,  T]  to  be  the  run  where  the  order  of  thread  interleaving  is  fixed  and  given  by  the 
global  thread  index,  such  that  for  every  n,  n'  G  T(C,  T),  n  <  n'  then  thread  n  must  execute  before  thread 

n'  in  p* .  In  other  words,  for  every  sequence  of  transitions  Si  S2  S3  in  P*^  we  have  n  <  n' .  Notice 
that  p*  is  uniquely  defined. 

We  will  show  that  for  any  other  run  p,  we  have  p*  ~  p.  We  will  do  this  by  transforming  p  to  a  run  p' , 
such  that  p'  ~  p,  and  p  and  p'  have  only  one  difference:  a  pair  of  successive  transitions  si  S2  S3  in  p 

TL  Oi  Th  CK 

such  that  n  >  n'  is  swapped  to  give  si  A-  S2  A-  S3  in  p'.  We  will  show  that  it  is  always  possible  to  do  this 
swapping  and  moreover  that  S3  =  S3.  Then,  it  should  be  clear  that  p  ~  p' .  By  repeatedly  applying  swapping, 
we  can  transform  p  to  p*.  Since  all  intermediate  runs  are  strongly  equivalent,  it  follows  that  p  ~  p*. 

To  show  that  swapping  is  always  possible,  we  distinguish  the  following  cases. 

Gase  (A):  a'  is  a  condition  statement,  i.e.,  a  boolean  expression  e.  In  this  case,  S2  and  S3  are  identical 
except  for  the  program  counter  of  thread  n'\  this  is  because  condition  statements  do  not  modify  arrays. 
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We  need  to  show  that  e  is  satisfied  at  state  si.  Then,  the  transition  si  s'2  exists,  and  s'2  is  identical  to 
Si  except  for  the  program  counter  of  thread  n' .  Thus,  the  transition  s'2  S3  also  exists,  and  S3  must  be 
identical  to  S3. 

Suppose  Si  does  not  satisfy  e.  On  the  other  hand,  we  know  that  S2  satisfies  e.  Then,  there  must  exist  some 
array  element  whose  value  changes  during  transition  si  S2.  This  means  that  a  must  be  an  assignment  of 
the  form  A[ei]  :=  63.  Moreover,  e  must  have  an  array  sub-expression  ^[62].  Finally,  it  must  be  that 

si(ei[C',T,  b(n),t(n)])  =  S2(e2[C',  T,  b(n'),  t(n')])  (6) 

that  is,  if  ei  evaluates  to  some  index  j  at  thread  n  then  62  evaluates  to  the  same  index  at  thread  n' .  Let 
&i,  &2,  ti,  O  be  such  that  n  =  bi-T  +  ti  and  n'  =  b2-T  +  ^2-  Also  note  that  both  the  assignment  A[ei]  :=  63 
and  the  condition  statement  e  must  be  statements  of  the  same  sync-segment,  say  Fi.  This  is  because  there 

is  no  sync-transition  (in  fact,  there  is  no  transition  at  all)  between  transitions  si  ^  S2  and  S2  -b  S3.  We 
distinguish  two  further  cases. 

Case  (A.l):  A  is  a  global  array.  Then  A[ei]  G  LHS(Fi)  and  A[e2]  G  RHS(Fi).  Thus,  (61,62)  G  £g{Fi). 
Then  Formula  (2)  is  not  valid.  Indeed,  n  ^  n'  implies  61  yf  62  Vti  yf  t2,  and  (6)  implies  that  ei[C,  T,  bi,ti]  = 
62[C',  T,  62,  ^2]  holds  in  the  logic  of  uninterpreted  functions.  This  contradicts  the  assumption  that  F  is 
non-interfering  w.r.t.  C,T. 

Case  (A. 2):  A  is  a  local  array.  Then  again  A[6i]  G  LHS(Fi)  and  A[e2]  G  RHS(Fi).  In  this  case, 
(61,62)  G  £i{Fi).  Then  Formula  (2)  is  not  valid.  Indeed,  A  is  a  local  array,  thus  there  is  a  separate  instance 
of  A  at  each  core  k  G  {0,  ...,(7  —  1}.  Thus,  n  and  n'  must  be  threads  running  at  the  same  core,  that  is, 
bi  =  &2-  This  and  n  ^  n'  imply  ti  yf  ^2,  and  (6)  implies  that  6i[C,  T,  61,  ti]  =  e2[C,  T,  61,  ^2]  holds  in  the 
logic  of  uninterpreted  functions.  Again  this  contradicts  the  assumption  that  F  is  non-interfering  w.r.t.  C,  T. 
This  completes  Case  (A). 

TX  Ci  TX  Ox 

Case  (B):  a'  is  an  assignment  statement  A[e]  :=  e' .  In  this  case,  both  transitions  Si  A-  s'2  and  s'2  A  Sg 
exist.  We  need  to  show  that  S3  =  S3.  Suppose  S3  yf  S3.  This  means  that  there  exists  some  array  A  and 
element  A\j]  such  that  S3(A[j])  yf  S3(A[j]).  There  are  two  cases:  either  A[j]  is  set  in  both  a  and  a' ,  or  it  is 
only  set  in  one  of  them,  and  the  other  modifies  a  value  used  in  the  first.  In  both  cases,  using  reasoning  similar 
to  the  above,  we  can  show  that  one  of  non-interference  formulas  (2)  or  (2)  is  invalid,  which  contradicts  the 
assumption  that  F  is  non-interfering.  The  details  are  omitted. 

A. 2  Proof  of  Theorem  2 

Suppose  s  G  TZf{P,C,T).  Suppose  (3)  is  not  satisfied  at  s.  Then  there  exist  b  G  {0,  ...,(7  —  1}  and 
t  G  {0, ...,  T  —  1}  such  that  for  any  control-fiow  path  tt,  (j)Tr[C,  T,  b,  t]  is  not  satisfied  at  s.  Let  n  =  b  ■  T  +  t. 
Let  p  be  a  maximal  run  p  starting  at  some  initial  state  Sq  reaching  s.  P  is  non-interfering  w.r.t.  (7,  T, 
therefore,  by  Theorem  1,  P  is  strongly  deterministic  w.r.t.  C,T.  This  means  that  we  can  assume  that  p  is 
such  that  thread  n  is  the  last  thread  to  execute,  after  all  other  threads  have  executed:  by  strong  determinism, 
p  will  still  reach  the  same  state  s. 

Let  TT  =  go  ^  ^  9m  be  the  control-fiow  path  that  thread  n  follows  in  p.  There  must  be  some 

i  G  {!,..., to}  such  that  il)i[C,T,b,t])  is  not  satisfied  at  s,  where  '0i  is  the  boolean  expression  obtained  from 
Ui-  We  distinguish  two  cases. 

Case  (A):  ai  is  an  assignment  statement  A[6]  :=  e' .  Then,  ipi  is  the  equality  A[e]  =  e'.  Let  j  = 
s{e[C,T,b,t])  and  v  =  s{e'[C,T,b,t])-  'ipi[C,T,b,t]  not  satisfied  at  s  means  s(A[j])  yf  v.  Since  tt  is  the 
control-fiow  path  that  thread  n  follows  in  p,  thread  n  must  execute  the  assignment  A[6]  :=  e' .  Therefore, 
p  must  have  a  transition  si  S2-  We  claim  that:  (1)  Si{e[C,T,b,t])  =  j  (which  means  that  A[j]  is 
written  in  this  transition)  and  (2)  si{e'\C^T,b,t])  =  v.  From  the  semantics  of  assignments,  (1)  and  (2)  imply 
S2(A[j])  =  V.  By  the  SAA  assumption,  A[j]  is  written  exactly  once  in  p,  therefore,  its  value  at  s  must  be 
the  same  as  its  value  at  S2-  Thus,  s(A[j])  =  v:  contradiction. 

We  proceed  to  prove  claims  (1)  and  (2)  above.  We  will  show  that  every  sub-expression  of  e  or  e'  has  the 
same  value  at  si  as  it  has  at  s.  Such  a  sub-expression  can  be  of  the  following  type: 
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•  A  constant:  obviously  it  always  has  the  same  value. 

•  A  parameter  among  C,  T,  b,  t:  because  these  parameters  are  substituted  by  the  same  values  (7,  T,  b,  t, 
respectively,  they  are  the  same  in  si{e[C,T,b,t])  and  s{e[C,T,b,t]). 

•  A  global  input  array  element:  input  arrays  are  never  written,  thus  they  maintain  a  constant  value 
throughout  a  run. 

•  An  element  of  a  writable  array,  say,  B[k].  By  the  LAI  assumption,  B[k]  must  be  written  before  it  is 
read,  therefore,  before  the  transition  si  S2-  By  the  SAA  assumption,  B[k]  is  only  written  once. 
Therefore,  B[k]  has  the  same  value  at  si  and  at  s. 

Thus,  claims  (1)  and  (2)  hold,  which  completes  the  proof  for  case  (A). 

Case  (B):  ai  is  a  condition  statement,  i.e.,  a  boolean  expression  e.  Then,  ipi  is  e.  il)i[C,T,b^t]  not 
satisfied  at  s  means  s(e[C',  T,  b,t])  =  false.  Since  tt  is  the  control-flow  path  that  thread  n  follows  in  p,  thread 
n  must  execute  the  condition  statement  e.  Therefore,  p  must  have  a  transition  si  ^  S2,  which  implies  that 
Si(e[C,  T,  6,  t])  =  true.  Following  a  reasoning  similar  to  the  above,  we  can  show  that  e[C',  T,  6,  t]  takes  the 
same  value  at  si  and  at  s:  contradiction.  This  completes  the  proof. 

A. 3  Proof  of  Theorem  3 

Suppose  Pi  P2-  Then  there  exist  C,  T  G  N  such  that  (C,  T)  G  F  and  Pi  ^c.t  Pi-  This  in  turn  means  that 
there  exist  maximal  runs  pi  G  |Pi,C,  T]  and  p2  G  |P2,C,T]  such  that  pi  ^  p2-  That  is,  pi  and  p2  start  at 
equivalent  initial  states  Sq  «  Sg  but  end  at  non-equivalent  final  states  si  9^  S2- 

(C,  T)  G  F  implies  that  (/>r  is  satisfied  by  C,  T.  We  will  show  that  (fpost  also  holds,  but  (font  does  not  hold. 
This  means  (4)  is  invalid. 

Si  76  S2  implies  that  there  exist  B  G  Gout  and  j  such  that  si{B[j])  yf  S2{B[j]).  We  will  show  that 
(fpost  A  B^[j]  ^  B'^[j]  is  a  satisflable  formula.  Suppose  it  is  not.  Then,  (jipost  implies  B^[j]  =  B'^[j].  In  the 
theory  of  uninterpreted  functions  this  means  that  if  states  s  and  s'  satisfy  ftpost  then  s(P[j])  =  s'{B[j]).  By 
Theorem  2,  si  and  S2  satisfy  4>posti  thus,  Si(i?[j])  =  S2{B[j]):  contradiction. 

B  Proving  equivalence  for  the  array  reversal  example 

As  shown  in  Section  6,  Formula  (4)  instantiates  on  our  running  example  as  Formula  (5).  To  prove  that  the 
two  array  reversal  programs  are  equivalent,  we  need  to  show  that  Formula  (5)  is  valid.  Suppose  this  is  not 
the  case.  Then  there  exist  C,  T  G  N  such  that 

(vbG{0,...,C-l},VtG{0,...,T-l}: 
pi  [C  •  T  -  1  -  (b  •  T  + 1)]  =  A[b  •  T  -f  t]  A 
Loc°[\\  =  A[b  •  T  -f  t]  A 
P2[(C  -  1  -  b)  •  T  +  t]  =  Loc'^[J  -  1  -  t]) 

holds  and 

VjG{0,...,C-T-l}:Pi[j]  =  p2[j] 

does  not  hold.  The  latter  implies  there  exists  j  G  {0, ...,  C  •  T  —  1}  such  that  B^[j]  P^[j].  We  can  And 

unique  bg  G  {0, ...,  C  —  1}  and  tg  G  {0, ...,  T  —  1}  such  that  j  =  6g  -T-l-tg.  Then,  pi[6g  -T -|-tg]  yf  P^[&g  -T -l-tg]. 
Let  6  =  C  —  1  —  6g  and  t  =  T  —  1  —  tg.  Then: 

C  ■  T  —  1  —  (&  ■  T  -|- 1)  =  60  ■  T  -|-  to  (9) 


(7) 


(8) 
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From  (7)  and  the  facts  bn,b  G  {0, C  —  1}  and  to,t  G  {0,  ...,T  —  1},  we  get  the  following  equalities: 


B^[C-T-l-{b-T  +  t)]  =  A[6-T  +  t] 

(10) 

Loc^[t]  =  A[b-T  +  t] 

(11) 

B^[{C-l-b)-T  +  to]  =  Loc^[T-l-to] 

(12) 

L 

B^[bo-T  +  to]  =  A[b-T  +  t] 

(13) 

fact  t  =  T  —  1  —  to,  we  get 

i?^[6o  •  T  +  to]  =  Loc^[t]  =  Loc^\T  —  1  —  to] 

(14) 

From  (14),  (12)  and  the  fact  b  =  C  —  1  —  bg,  we  get 

B^[bg  •  T  +  to]  =  B^[{C  —  1  —  &)  •  T  +  to]  =  B'^[bo  •  T  +  to] 

which  contradicts  our  assumption  B^[bo  •  T  +  to]  ^  i?^[6o  •  T  +  to].  Thus,  (5)  must  be  valid. 
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